The Bootstrap

Elizabeth King
Kevin Middleton

What is the bootstrap?

  • First formalized and named by Efron. The technique was named after “pulling yourself out of the mud by your bootstraps”.
  • A general way to take random samples from a dataset to approximate the true distribution
    • Assumes the observed dataset is an adequate representation of the true distribution

The General Procedure

  1. Begin with the full dataset of size n
  2. Sample n values with replacement from the dataset
  3. Calculate the parameter of interest
  4. Repeat B times to get B bootstrapped values

Bootstrap Sampling

  • Assumption: the observed distribution is representative of the true distribution
dd <- tibble("phenotype" = rnorm(15),
             "id" = letters[1:15],
             "set"= rep(1,15))
p1 <- dd |>
  ggplot(aes(phenotype, set, label = id)) + 
  geom_label(position = position_jitter(height = 0.3, seed = 34)) +
  scale_y_continuous(NULL, breaks = NULL, limits = c(0.5, 3.5))

p1

Bootstrap Sampling

Bootstrap Sampling

1boot1 <- dd[sample(seq(1, nrow(dd)), replace = TRUE), ]
2boot1$set <- 2

3p2 <- p1 +
  geom_label(data = boot1, 
             position = position_jitter(height = 0.3, seed = 34),
             color = "firebrick") 

p2
1
Sample the rows of dd with replacement to make a new dataset of the same n
2
Label this set
3
Add it to our plot

Bootstrap Sampling

Bootstrap Sampling

Example: Shannon’s Diversity Index

Robinson et al. 2012. Butterfly community ecology: the influences of habitat type, weather patterns, and dominant species in a temperate ecosystem.

Rows: 9
Columns: 2
$ Species <chr> "Cercyonis pegala", "Colias philodice", "Erynnis persius", "Eu…
$ N       <dbl> 26, 10, 1, 143, 44, 1, 59, 17, 7

Example: Shannon’s Diversity Index

\[H = -\sum_{i=1}^{N_{species}}{p_i}\ln({p_i})\] \(p_i\) is the relative abundance of the \(i\)th species

bb$relA <- bb$N / sum(bb$N)

-sum(bb$relA * log(bb$relA))
[1] 1.553839

Example: Shannon’s Diversity Index

library(boot)

1bb_full <- rep(bb$Species, bb$N)

2shanH <- function(sp_list,indicies) {
  sp_ab <- table(sp_list[indicies]) / length(sp_list)
  return(-sum(sp_ab * log(sp_ab)))
}

3shan_bs <- boot(data = bb_full,
                statistic = shanH,
                R = 1000)
1
Create a vector with a list of all individual observations
2
Make a function for calculating the diversity index from a bootstrapped sample
3
Get 1000 bootstrap samples

Example: Shannon’s Diversity Index

Confidence Interval: Standard Method

  • Assume the bootstrapped samples are normally distributed
mu_b <- mean(shan_bs$t)
s_b <- sd(shan_bs$t)
cis <- c(mu_b + qnorm(0.975) * s_b, 
         mu_b - qnorm(0.975) * s_b)

cis
[1] 1.637101 1.438006
sum(shan_bs$t >= cis[1])
[1] 17
sum(shan_bs$t <= cis[2])
[1] 33

Confidence Interval: Standard Method

  • Assume the bootstrapped samples are normally distributed

Confidence Interval: Percentile Method

cis <- c(quantile(shan_bs$t,0.975),
         quantile(shan_bs$t,0.025))

cis
   97.5%     2.5% 
1.632061 1.431249 
sum(shan_bs$t >= cis[1])
[1] 25
sum(shan_bs$t <= cis[2])
[1] 25

Confidence Interval: Percentile Method

Does bootstrapping work in this case?

  • Simulate populations with known true species compositions
  • Sample as you would in the field
  • Try this method
  • What percentage of time does the bootstrap confidence interval include the true Shannon index?

Does bootstrapping work in this case?

Pla, L. 2004. Bootstrap Confidence intervals for the Shannon biodiversity index: A simulation study. J Agric Biol Environ Stat 9:42–56.

Does bootstrapping work in this case?

Palma et al. 2022. New confidence interval methods for Shannon index.

General Considerations

  • Bootstrap methods assume your observed dataset approximates the true distribution
    • This assumption is not likely for small samples
  • Every use case must be shown to be valid via simulation
  • Decisions
    • How many samples?
    • What CI method?
    • How to simulate data to test the method?

Common Use Cases

  • Confidence intervals (most common)
  • Hypothesis tests (less common)
    • Support for branches on phylogenetic trees (Felsenstein, J. 1985. Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution 39:783–791.)